[trainer/deepspeed] load_best_model (reimplement re-init) by stas00 · Pull Request #17151 · huggingface/transformers

stas00 · 2022-05-10T05:13:34Z

This PR fixes #17114

The deepspeed_reinit hack proved to not always work, as some stored args appear to be either stale or wrong (e.g. the optimizer can be a deepspeed outer optimizer which shouldn't be the case), so trying to just do a full init from scratch instead.

And then there was an issue on the deepspeed side with model getting deepspeed hooks added multiple time which was breaking everything. Fixed in deepspeedai/DeepSpeed#1947.

I spent many hours trying to reproduce the problem in the usual way via examples scripts to make a test, but alas, it just won't fail in the right places. So I ended up re-implemented test_load_best_model using code derived from @base-y's repro example. So super appreciating having their script.

Blocking events

merge DeepSpeed needs to start cleaning up deepspeedai/DeepSpeed#1947
update the dependency table to the new ds release after the merge 0.6.5

Fixes: #17114

HuggingFaceDocBuilderDev · 2022-05-10T05:28:39Z

The documentation is not available anymore as the PR was closed or merged.

dumpmemory · 2022-06-02T06:57:22Z

when will this pr be merged ?

stas00 · 2022-06-02T15:38:39Z

We needed to wait for a new deepspeed release, which I see has happened already, so yes, we can merge this shortly.

sgugger

LGTM!

…e#17151) * [trainer/deepspeed] load_best_model * to sync with DS PR huggingface#1947 * simplify * rework load_best_model test * cleanup * bump deepspeed>=0.6.5 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

[trainer/deepspeed] load_best_model

e9a00fa

stas00 mentioned this pull request May 10, 2022

Error on loading saved optimizer after training (zero-3) #17114

Closed

4 tasks

stas00 mentioned this pull request May 10, 2022

[trainer] sharded _load_best_model #17150

Merged

tjruwase and others added 5 commits May 11, 2022 10:52

to sync with DS PR #1947

a86c745

simplify

94562e7

Merge remote-tracking branch 'origin/main' into ds-load_best_model

d4cc562

rework load_best_model test

8079aa9

cleanup

9ece4a3

stas00 added 2 commits June 2, 2022 08:35

bump deepspeed>=0.6.5

b1149af

Merge remote-tracking branch 'origin/main' into ds-load_best_model

736153a

stas00 changed the title ~~[WIP] [trainer/deepspeed] load_best_model~~ [trainer/deepspeed] load_best_model (reimplement re-init) Jun 2, 2022

stas00 marked this pull request as ready for review June 2, 2022 15:51

stas00 requested a review from sgugger June 2, 2022 15:51

sgugger approved these changes Jun 2, 2022

View reviewed changes

stas00 merged commit 2f59ad1 into main Jun 2, 2022

stas00 deleted the ds-load_best_model branch June 2, 2022 16:14

stas00 mentioned this pull request Jun 3, 2022

[deepspeed] fix load_best_model test #17550

Merged

regisss mentioned this pull request Jun 6, 2022

Remove deepspeed_reinit huggingface/optimum#210

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer/deepspeed] load_best_model (reimplement re-init)#17151

[trainer/deepspeed] load_best_model (reimplement re-init)#17151
stas00 merged 8 commits intomainfrom
ds-load_best_model

stas00 commented May 10, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 10, 2022 •

edited

Loading

Uh oh!

dumpmemory commented Jun 2, 2022

Uh oh!

stas00 commented Jun 2, 2022

Uh oh!

sgugger left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

stas00 commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Blocking events

Uh oh!

HuggingFaceDocBuilderDev commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dumpmemory commented Jun 2, 2022

Uh oh!

stas00 commented Jun 2, 2022

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stas00 commented May 10, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 10, 2022 •

edited

Loading